Geodemographic Analysis with PySAL and scikit-learn

Here, we'll examine geodemographic clustering in Los Angeles County

Data Prep

Geodemographic Clusters

Geodemographic analysis, which includes applying unsupervised learning to demographic and socioeconomic data, followed by a spatial analysis of the results

There are some obvious spatial patterns (which we might expect, given the results of our prior esda and segregation analyses). But what do these clusters mean? What kinds of demographic features do they represent?

This table is a lot to interpret at once, so a visualization would be handy. Violin plots are a nice way of examining how each of the input variables is distributed in each of the resulting clusters

We can also use a statistic to tell us how well this model fits the data. To do so, we can use scikit-learn's silhouette score

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

What about other clustering algorithms or other numbers for k? Might we get a better model fit?

This will create a linked holoviews plot so we can zoom in on both maps together (click the "wheel zoom" button on the bokeh plot so you can zoom in)

The silhouette score tells us that the affinity propagation clusterer provided a better solution. Nonetheless, we end up with similar spatial patterns

Spatially-Constrained Geodemographics (Regionalization)

Above, we notice there are some obvious spatial patterns in the neighborhood clusters. That happens due to the underlying spatial autocorrelation in the race and class indicators we used to develop the clusters. Instead of allowing this autocorrelation to "fall out" of the results, we can leverage it to create spatially-contiguous clusters

scikit-learn's agglomerative clustering algorithm allows us to pass a constraint and it accepts a pysal W object. Lets compare solutions with and without the constraint

Why is the silhouette score higher for the first soluttion?

Exercise

  1. Two geodemographic typologies for Orange County using the same race and class variables as above
    • for the first, use 5 clusters
    • for the second, use 8 clusters
    • which solution is better?
  1. Create a geodemographic typology for Riverside County using Affinity Propagation with damping=0.8 and preference=-100
    • How many unique clusters do you find?
    • What is the average home price for tracts in Cluster 3?
  1. What would happen if you created a spatially-constrained geodemographic typology using DistanceBand spatial weights?